Ensemble Techniques

                                                                                    Aryan Jain

Data Description and Context: Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician

Domain: Medicine

Learning Outcomes:

Objective: Goal is to classify the patients into the respective labels using the attributes from their voice recordings

Import Python Libraries

---------------------------- Exploratory Data Analysis (EDA) -----------------------

Loading the dataset

Note: This only shows 20 out of the 24 columns, let's fix that so it shows all of them

Eye-ball raw data / Understand the dataset

Note:

Checking missing values

Note:
No missing value in dataset.

Count unique values

Note
The column "name" appears a unique identifier for data rows. No relation with target variable. We will drop "name"

Note

----------------------------------- Data Visualization --------------------------------

1. Data distribution of each attribute: Univariate analysis
2. Based in pair plot & Corr heatmap, understand the Influence of important features Multivariate analysis

Univariate analysis

All features are numeric, I will use distplot for each variable against target (status) values

Basic Stastics / Understand data & spread

Note

Bivariate analysis

If any 2 independent variables have a correlation of above 0.99, I will drop 1 of them as I will assume that the other one provides the same information and the rest is noise

Observations

Based on Pair plot & Correlation map we can infer the association between the attributes and target column:

Vocal fundamental frequency

Tonal component of frequency

Variation in amplitude

Influence of important features on status

</br>Correlation


Others

------------------------- Model Building ---------------------------------

Dimensionality Reduction

Dropping highly correlated independent variables

We now have only 19 independent features, and 1 target variable

Seperate, Scale & Prepare

I will scale the features as standard practice to help address any potential bias.

Spliting the data into training and test set in the ratio of 70:30

Standard algorithms & accuracy on test data

Training & Testing models

run_model function

Below steps will build, train & test various models

Train a Meta-classifier - Stacking & Random Forest

Performance Comparison

The number of models are selected as per instructions in project file

Performance detail of all models is displayed in below table

Final Conclusion

Best Model - Meta Classifier (Stacking)

Based on performance summary table listed above, the Meta-model / Stacking has performed the best across all performance metrics. Additioanlly, the performance is well balanced with training and test data.

---------------------------------------END----------------------------------------------